Quickly identifying identical and closely related subjects in large databases using genotype data
نویسندگان
چکیده
Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.
منابع مشابه
Investigation of Epidemiological Features of Measles Outbreaks in the World in 2018
Introduction: Identifying the epidemiological features of reported measles outbreaks including the size, period, and generation of the outbreaks plays a significant role in preventing new outbreaks and estimating effective reproduction number (R) as an indication of measles elimination. This study was conducted to describe the reported measles outbreaks in the world in 2018. Method: The PubM...
متن کاملبررسی کاربردهای داده کاوی در نظام سلامت
Introduction: Extensive amounts of data stored in medical databases require the development of specialized tools for accessing the data, data analysis, knowledge discovery, and the effective use of the data. Data mining is one of the most important methods. The article sketches the used Data Mining techniques, and illustrates their applicability to medical diagnostic and prognostic problems. ...
متن کاملA systems level approach to characterize the phenotypic effects of SNPs in closely related bacterial genomes
Motivation: Mutations such as SNPs account for much variation present between closely related bacterial strains. A plethora of sequencing data is available for closely related strains of bacteria. With this data comes the need to analyse the effects of these mutations on phenotype is. SNP databases exist, such as dbSNP which store SNP data in flat-file format thus limiting analysis. To truly un...
متن کاملمطالعه تغییرات مکانی شوری خاک در منطقه رامهرمز (خوزستان) با استفاده از نظریه ژئواستاتیستیک 2- کوکریجینگ
The analysis of the EC data set indicated that the spatial distribution of EC data of different depths are closely related to one another. It means that they are spatially cross correlated on one another and can be considered to be co-regionalized. It also implies that EC values at a particular depth contain useful information about the other depths which can be used to improve their estimation...
متن کاملمطالعه تغییرات مکانی شوری خاک در منطقه رامهرمز (خوزستان) با استفاده از نظریه ژئواستاتیستیک 2- کوکریجینگ
The analysis of the EC data set indicated that the spatial distribution of EC data of different depths are closely related to one another. It means that they are spatially cross correlated on one another and can be considered to be co-regionalized. It also implies that EC values at a particular depth contain useful information about the other depths which can be used to improve their estimation...
متن کامل